Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] Target Allocator Manager quits if the initial sync fails #241

Merged
merged 1 commit into from
Oct 15, 2024

Conversation

okankoAMZ
Copy link

Description:
If the CloudWatch Agent Collector pod starts before the Target Allocator pod, it will fail to ping the Target Allocator. This failure causes the Agent's target allocator thread to end, resulting in prometheus metrics being lost. Currently, the only solution is to restart the pods if this happens.

To overcome this , I removed the return failure on Start of the TA Manager thread; thus, TA Manager can keep on trying.

This ensures no loss in metrics. This also has no extra cost since if the scrape_config(savedHash) is the same as the previous one, sync function will immediately return. In the case if the ping keeps failing the savedHash will be the same the as the hash--which is 0, requiring no extra computing power.

Testing:
Manually tested, here you can see when the first one is failing it keeps on trying again.
TA Logs

@okankoAMZ okankoAMZ merged commit 4e2991d into target-allocator Oct 15, 2024
138 of 146 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants